Where did my disks go?
So now and then you may run into an issue which cannot be explained properly by just looking at the standard events that show up in \”/var/log/messages\”.
Issues such as
Oct 7 18:24:20 centos8 kernel: lpfc 0000:81:00.0: 0:1305 Link Down Event xc received Data: xc x20 x800110 x0 x0 Oct 7 18:24:24 centos8 kernel: rport-11:0-4: blocked FC remote port time out: removing target and saving binding Oct 7 18:24:24 centos8 kernel: lpfc 0000:81:00.0: 0:(0):0203 Devloss timeout on WWPN 50:06:0e:80:07:c3:70:00 NPort x01ee40 Data: x0 x8 x2
are fairly common and the above simply shows a Link Down event. These are the most easy to troubleshoot when the remote switchlog tell you
18:26:59.565715 SCN Port Offline;rsn=0x10004,g=0x12 A2,P0 A2,P0 93 NA 18:26:59.565721 *Removing all nodes from port A2,P0 A2,P0 93 NA 18:28:07.998318 SCN LR_PORT(0);g=0x12 A2,P0 A2,P0 93 NA 18:28:08.006029 SCN Port Online; g=0x12,isolated=0 A2,P0 A2,P1 93 NA 18:28:08.007307 Port Elp engaged A2,P1 A2,P0 93 NA 18:28:08.007331 *Removing all nodes from port A2,P0 A2,P0 93 NA 18:28:08.007594 SCN Port F_PORT A2,P1 A2,P0 93 NA 18:28:08.099107 SCN LR_PORT(0);g=0x12 LR_IN A2,P0 A2,P0 93 NA 18:28:20.669283 SCN Port Offline;rsn=0x10004,g=0x14 A2,P0 A2,P0 93 NA 18:28:20.669288 *Removing all nodes from port A2,P0 A2,P0 93 NA
as a result of
Wed Oct 7 18:28:07 2020 admin, FID 43, 10.75.27.192, portenable 4/29 Wed Oct 7 18:28:20 2020 admin, FID 43, 10.75.27.192, portdisable 4/29
Diagnostics becomes more problematic when is just the events that show the links bounce but show no further information. Obtaining extended information from the HBA drivers may then be very helpful.
Update Drivers and Firmware
As you know I\’m very picky when it comes to maintenance. If I see cases where System and/or Storage administrators have basically been slacking for a long time the chances are very high that I will tell you that and commence diagnosing issues as soon as these things are all up to date. You don\’t want to know the sheer amount of issues that have been resolved in firmware and drivers over any given time-period.
That being said going to the Linux side of the Emulex (or Broadcom) drivers for the LP31000/LP32000 cards which are very popular in many form-factors.
The driver will show as an lpfc module and is by default compiled into a ramfs image when installed. This will allow the card to be used in a boot-from-san variation if needed. The module will load as such and register with the scsi-subsystem
lpfc 978944 81 nvmet_fc 32768 1 lpfc nvme_fc 45056 1 lpfc scsi_transport_fc 69632 1 lpfc
With the most recent versions of the driver it will also provide an NVMe_oF initiator and target so that NVM equipment can be utilized when attached to a FC fabric.
Logging
Loggin with an Emulex card can be done on the driver level as well as the HBA firmware. Unless you get some instructions to do so leave the firmware logging as is. Mainly because changing these parameters will require a reload of the driver that basically instructs the firmware logging facility to capture data in some host memory region. Obviously that will involve some engineering efforts to diagnose anyway so that will not be very helpful to yourself or your OEM support-organisation unless it needs escalating to Emulex.
Changing the logging verbosity of the driver itself is much easier but may also incur some performance impact so don\’t just flick on the \”0xFFFFFFFF debug\” button. The driver logging facility is a bitmap value based on the below table:
LOG | Message | Verbose Mask Definition Verbose Bit Verbose Description |
LOG_ELS | 0x00000001 | ELS events |
LOG_DISCOVERY | 0x00000002 | Link discovery events |
LOG_MBOX | 0x00000004 | Mailbox events |
LOG_INIT | 0x00000008 | Initialization events |
LOG_LINK_EVENT | 0x00000010 | Link events |
LOG_IP | 0x00000020 | IP traffic history |
LOG_FCP | 0x00000040 | FCP traffic history |
LOG_NODE | 0x00000080 | Node table events |
LOG_TEMP | 0x00000100 | Temperature sensor events |
LOG_BG | 0x00000200 | BlockGuard events |
LOG_MISC | 0x00000400 | Miscellaneous events |
LOG_SLI | 0x00000800 | SLI events |
LOG_FCP_ERROR | 0x00001000 | Log errors, not underruns |
LOG_LIBDFC | 0x00002000 | Libdfc events |
LOG_VPORT | 0x00004000 | NPIV events |
LOG_SECURITY | 0x00008000 | Security events |
LOG_EVENT | 0x00010000 | CT,TEMP,DUMP, logging |
LOG_FIP | 0x00020000 | FIP events |
LOG_FCP_UNDER | 0x00040000 | FCP underruns errors |
LOG_SCSI_CMD | 0x00080000 | ALL SCSI commands |
LOG_NVME | 0x00100000 | NVME general events |
LOG_NVME_DISC | 0x00200000 | NVME discovery/connect events |
LOG_NVME_ABTS | 0x00400000 | NVME ABTS events |
LOG_NVME_IOERR | 0x00800000 | NVME I/O Error events |
LOG_EDIF | 0x01000000 | External DIF events |
LOG_AUTH | 0x02000000 | Authentication events |
If you don\’t know what these mean, or have no clue on how to interpret the output, it\’s not much use mucking around with these. The output will only confuse you and if you don\’t know what the commands and responses should be it\’s only a bunch of hex values.
The values as displayed above can be summed depending on which verbose logging needs to be enabled. For instance if your OEM asks you for Link events, ELS and Initialiation events you may get asked to enable verbose logging with either the \”hbacmd\” or via \”sysfs\”. The value of the parameter will than be \”0x19\”
hbacmd or sysfs
If you have hbacmd installed any change done in the logging preferences also automatically kicks of dracut and builds a new boot image. The command has a few additional parameters
hbacmd setdriverparam 10:00:00:90:fa:c7:cd:f9 G P log-verbose 0x135661
The first three a fairly obvious. Command setting driver parameters for PWWN 10:xxxxxx. The G stands for Global basically meaning it is valid for all adapters and the P stands for Permant. That ensures the parameter that follows is also applied after reboots. The log-verbose parameter is basically the configuration what we\’re adjusting. The 0x135661 is a combination of values obtained via the table above.
The value can also dynamically be applied via sysfs in the \”/sys/class/scsi_host/host<X>\” (where <X> is the adapter ID) directory. The LPFC driver will create the system file as appropriate in that folder and one of which is indeed the \”lpfc_log_verbose\” file. The 0x<123456> value can be echoed to that file and the driver will dynamically pick this up.
[root@centos8 host11]# cat lpfc_log_verbose 0x0 [root@centos8 host11]# echo 0x135661 > lpfc_log_verbose
The change is immediate logged
Oct 8 17:03:15 centos8 kernel: lpfc 0000:81:00.0: 0:(0):3053 lpfc_log_verbose changed from 0 (x0) to 1267297 (x135661)
When you change all of them with
[root@centos8 scsi_host]# echo 0x135661 > host11/lpfc_log_verbose [root@centos8 scsi_host]# echo 0x135661 > host12/lpfc_log_verbose [root@centos8 scsi_host]# echo 0x135661 > host13/lpfc_log_verbose [root@centos8 scsi_host]# echo 0x135661 > host14/lpfc_log_verbose
The messagelog will show something similar like this
Oct 8 17:28:28 centos8 kernel: lpfc 0000:81:00.0: 0:(0):3053 lpfc_log_verbose changed from -1 (xffffffff) to 1267297 (x135661) Oct 8 17:28:50 centos8 kernel: lpfc 0000:81:00.1: 1:(0):3053 lpfc_log_verbose changed from 1267297 (x135661) to 1267297 (x135661) Oct 8 17:28:58 centos8 kernel: lpfc 0000:83:00.0: 2:(0):3053 lpfc_log_verbose changed from 1267297 (x135661) to 1267297 (x135661) Oct 8 17:29:04 centos8 kernel: lpfc 0000:83:00.1: 3:(0):3053 lpfc_log_verbose changed from 1267297 (x135661) to 1267297 (x135661)
The interesting past is that the paths to the adapter entries are used here. This is reflected in the \”0000:81:00.0:\”, 0000:83:00:0:\” etc entries.
Remember that in normal circumstances you would not need to change these values. The basics are logged anyway and only in specific circumstances you would need to adjust that. Also be aware that using a debug value of 0xFFFFFFFF can incurr a significant performance overhead on busy systems as a lot needs to be logged.
Another thing that I get often queried about is which HBA port belongs to which SCSI number.
Identifcation of the respective HBA\’s can be done by looking at the adapter entries int he eventlog as mentioned above. In this case the 81 and 83 values are a reflection of the PCI id and the 00:.0 and 00.1 are the individual ports on those adapters.
81:00.0 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01) 81:00.1 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01) 83:00.0 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01) 83:00.1 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01)
You can see these entries coming back in the /sys/class/fc_host directory where logical links to the PCI devices are created
lrwxrwxrwx. 1 root root 0 Sep 4 15:08 host11 -> ../../devices/pci0000:80/0000:80:03.0/0000:81:00.0/host11/fc_host/host11 lrwxrwxrwx. 1 root root 0 Sep 4 15:08 host12 -> ../../devices/pci0000:80/0000:80:03.0/0000:81:00.1/host12/fc_host/host12 lrwxrwxrwx. 1 root root 0 Sep 4 15:08 host13 -> ../../devices/pci0000:80/0000:80:03.2/0000:83:00.0/host13/fc_host/host13 lrwxrwxrwx. 1 root root 0 Sep 4 15:08 host14 -> ../../devices/pci0000:80/0000:80:03.2/0000:83:00.1/host14/fc_host/host14
As soon as you know this you can associate the respective WWN of the adapter to the one you see on the switch:
[root@centos8 ~]# cat /sys/class/fc_host/host11/port_name 0x10000090fac7cde8
Sydney_ILAB_X6_43_TEST:FID43:admin> switchshow switchName: Sydney_ILAB_X6_43_TEST switchType: 165.0 <snip> Index Slot Port Address Media Speed State Proto ============================================================ 66 4 2 01ef40 id N8 Online FC F-Port 50:06:0e:80:10:13:b5:b8 <snip> 93 4 29 010000 id 16G Online FC F-Port 10:00:00:90:fa:c7:cd:e8
The above shows you when you see errors happening as part of a SAN attached disk where to look and how to assocaite the Emulex adapters to the respective WWN\’s on your SAN.
From there on you can also identify which disks are presented to that adapter. As you\’ve seen above the PCI subsystem creates a host interface per FC port. In my case these are host11 to host14.
A simple way to check is to just do an \”ls\” on /sys/class/scsi_disk/device/block tree.
[root@centos8 scsi_disk]# ls */device/block/ <snip> \'11:0:0:0/device/block/\': sdm <snip> \'11:0:0:8/device/block/\': sdu
As you can see the 11:xxxxx entries will list the respetive \”/dev/sd*\” entries that is being used for mounting volumes, MPIO listings etc.
Obviously there are heaps of tools available to ease your troubleshooting efforts. I would advise to install the Emulex OCMananger tools that are provided as a free separate package. It can be installed as an agent and agent-less feature. Other tools like \”lsblk\”, \”blockdev\”, sg-tools package and a few more are there to make your life a bit easier so you don\’t have to crawl thru the sysfs tree yourself.
Let me know if this was helpful. You feedback is much appreciated.
Regards,
Erwin